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Abstract 

We describe an adaptation and application 
of a search-based structured prediction al- 
gorithm "Searn" to unsupervised learning 
problems. We show that it is possible to 
reduce unsupervised learning to supervised 
learning and demonstrate a high-quality un- 
supervised shift-reduce parsing model. We 
additionally show a close connection between 
unsupervised Searn and expectation maxi- 
mization. Finally, we demonstrate the effi- 
cacy of a semi-supervised extension. The key 
idea that enables this is an applic;ation of the 
predict-self idea for unsupervised learning. 

1. Introduction 

A prevalent and useful version of unsupervised learn- 
ing arises when both the observed data and the la- 
tent variables are structured. Examples range from 
hidden alignment variables in speech recognition (Ra- 
biner, 1989) and machine translation (Brown et al., 
1993; Vogel et al., 1996), to latent trees in unsuper- 
vised parsing (Paskin, 2001; Klein & Manning, 2004; 
Smith & Eisner, 2005; Titov & Henderson, 2007), and 
to pose estimation in computer vision (Ramanan et al., 
2005). These techniques are all based on probabilistic 
models. Their applicability hinges on the tractability 
of (approximately) computing latent variable expecta- 
tions, thus enabling the use of EM (Dempster et al., 
1977). In this paper we show that a recently-developed 
search-based algorithm, Searn (Daume III et al., 2009 
to appear) (see Section 2.2), can be utilized for unsu- 
pervised structured prediction (Section 3). We show: 
(1) that under an appropriate construction, Searn can 
imitate the expectation maximization (Section 4); (2) 
that unsupervised Searn can be used to obtain com- 
petitive performance on an unsupervised dependency 
parsing task (Section 6); and (3) that unsupervised 
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Searn naturally extends to a semi-supervised setting 
(Section 7). The key insight that enables this work is 
that we can consider the prediction of the (observed) 
input to be, itself, a structured prediction problem. 

2. Structured Prediction 

The supervised structured prediction problem is the 
task of mapping inputs x to complex structured out- 
puts y (e.g., sequences, trees, etc.). Formally, let X 
be an arbitrary input space and 3^ be structure output 
spac;e. 3^ is typically assumed to decompose over some 
smaller substructures (e.g., labels in a sequence), y 
comes equipped with a loss function, often assumed 
to take the form of a Hamming loss over the sub- 
structures. Features are defined over pairs {x,y) in 
such a way that they obey the substructures (e.g., one 
might have features over adjacent label pairs in a se- 
quence). Under strong assumptions on the structures, 
the loss function and the features (essentially "local- 
ity" assumptions), a number of learning algorithms 
can be employed: for example, conditional random 
fields (Lafferty et al., 2001) or max-margin Markov 
networks (Taskar et al., 2005). 

A key difficulty in structured prediction occurs when 
the output space y, the features, or the loss, does not 
decompose nicely. All of these issues can lead to in- 
tractable computations at either training or prediction 
time (often both). An attractive approach for deal- 
ing with this intractability is to employ a search-based 
algorithm. The key idea in search-based structured 
prediction is to first decompose the output y into a se- 
quence of (dependent) smaller predictions yi,. . . ,yT- 
These may each be predicted in turn, with later pre- 
dictions dependent of previous decisions. 

2.1. Secirch-bcised Structured Prediction 

A recently proposed algorithm for solving the struc- 
tured prediction problem is Searn (Daume III et al., 
2009 to appear). Searn operates by considering each 
substructure prediction i/i , . . . , as a classification 
problem. A classifier h is trained so that at time t, 
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given a feature vector, it predict the best value for y^. 
The feature vector can be based on any part of the 
input X and any previous decision yi,. . . ,yt-i- This 
introduces a chicken-and-egg problem, h should ide- 
ally be trained so that it makes the best decision for yt 
given that h makes all past decisions j/i, . . . , yt_i and 
all future decisions yt+i,- ■ ■ , yr- Of course, at train- 
ing time we do not have access to h (we are trying 
to construct it). The solution is to use an iterative 
scheme. 

2.2. Searn 

The presentation wc give here differs slightly from the 
original presentation of the Searn algorithm. Our 
motivation for straying from the original formulation is 
because our presentation makes more clear the connec- 
tion between our unsupervised variant of Searn and 
more standard unsupervised learning methods (such 
as standard algorithms on hidden Markov models). 

Let 2?^^ denote a distribution over pairs {x^y) drawn 
from X xy, and let £{y, y) be the loss associated with 
predicting y when the true answer is y. We assume 
that y Gy can be decomposed into atomic predictions 
yi,. . . , yr, where each yt is drawn from a discrete set 
Y. A policy, TT, is a (possibly stochastic) function that 
maps tuples {x,yi,. . . , yt-i) to atomic predictions yt- 

The key ingredient in Searn is to use the loss func- 
tion £ and a "current" policy tt to turn V^^ into a dis- 
tribution over cost-sensitive (multiclass) classification 
problems (Beygelzimer et al., 2005). A cost-sensitive 
classification example is given by an input x and a cost 
vector c = (ci, . . . , ck), where is the cost of predict- 
ing class k on input x. Define by Sear,n(I>^^, £, tt) a 
distribution over cost-sensitive classification problems 
derived as follows. To sample from this induced dis- 
tribution, we first sample an example {x^y) ~ T)^^ . 
We then sample t uniformly from [1, T] and run tt for 
t — 1 steps on {x,y). This yields a partial prediction 
(yi, . . . , yt-i)- The input for the cost sensitive classifi- 
cation problem is then the tuple {x, yi,. . . , yt-i)- The 
costs are derived as follows. For each possible choice k 
of yt, we defined Ck as the expected loss if tt were run, 
beginning at (yi, . . . , yt-i,k) on input x. Formally: 

Ck = %+i,...,jiT-7r^(y, (yi, • • • , Vt-i, k, yt+i, . . . , yr)) 

(1) 

Searn assumes access to an "initial policy" tt* (some- 
times called the "optimal policy"). Given an input x, 
a true output y and a prefix of predictions yi,. . . , yt-i, 
TT* produces a best next-action, yt- It should be con- 
structed so that the choice yt is optimal (or close to 
optimal) with respect to the problem-specific loss func- 
tion. For example, if the loss function is Hamming loss. 



Algorithm SEARN-Learn(^, V^^ (3) 

1: Initialize tt = tt* 

2: while not converged do 

3: Sample: D -SEARN(DSP,£,7r) 

4: Learn: h ^ A{D) 

5: Update: tt <- (1 - /?)7r 

6: end while 

7: Return tt without reference to tt* 



Figure 1. The complete Searn algorithm. It's parameters 
are: a cost-sensitive classification algorithm A, a distribu- 
tion over structured problems , a loss function £, an 
initial policy tt* and an interpolation parameter (3. 

the TT* will always produce yt = yt- For more complex 
loss functions, computing tt* may be more involved. 

Given these ingredients, Searn operates according the 
algorithm given in Figure 1. Operationally, the sam- 
pling step is typically implemented by generating ev- 
ery example from a fixed structured prediction train- 
ing set. The costs (expected losses) are computed by 
sampling with tied randomness (Ng & Jordan, 2000). 

If /3 = l/T^, one can show (Daume III et al., 2009 to 

appear) that after at most 2T^lnr iterations, Searn 
is guaranteed to find a solution tt with structured pre- 
diction loss bounded as: 

i(7r) < i(7r*) + 24vgTlnT + c(l + lnr)/T (2) 

where L{Tr*) is the loss of the initial policy (typically 
zero), T is the length of the longest example, c is the 
worse-case per-step loss and £avg is the average multi- 
class classification loss. This shows that the structured 
prediction algorithm learned by Searn is guaranteed 
to be not-much-worse than that produced by the initial 
policy, provided that the created classification prob- 
lems are easy (i.e., that £avg is small). Note that one 
can use any classification algorithm one likes. 

3. Unsupervised Seeirn 

In unsupervised structured prediction, we no longer re- 
ceive an pair (x, y) but instead observes only an input 
x. Our job is to construct a classifier that produces y, 
even though we have never observed it. 

3.1. Reduction for Unsupervised to Supervised 

The key idea — one that underlies much work in unsu- 
pervised learning — is that a good y is one that enables 
us to easily recover x. This is precisely the intuition 
we build in to our model. The observation that makes 
this practical is that there is nothing in the theory 
or application of Searn that says that tt* cannot be 
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stochastic. Moreover, there is not requirement that the 
loss function depend on all components of the predic- 
tion. Our model will essentially first predict y and then 
predict x based on y. Importantly, the loss function is 
agnostic to y (since we do not have true outputs). 

The general construction is as follows. Let 2?""^"? be a 
distribution over inputs x € X and let y be the space 
of desired latent structures (e.g., trees). We define a 
distribution I)^"p over X x {y x X) hy defining a sam- 
pling procedure. To sample from V^^^, we first sample 
X <^ x>^nsup_ D^Qiy sample uniformly from the set 
of all y that are valid structures for x. Finally, we re- 
turn the pair {x, [y, x)). We define a loss function L by 
L[[y,x), {y,x)) — _L'"P"*(.x, a;) where L'"P"t is any loss 
function on the input space (e.g., Hamming loss). We 
apply Searn to the supervised structured prediction 
problem V^^^, and implicitly learn latent structures. 

3.2. Sequence Labeling Exeimple 

To gain insight into the operation of Searn in the 
unsupervised setting, it is useful to consider a sequence 
labeling example. That is, our input a: is a sequence 
of length T and we desire a label sequence y of length 
T drawn from a label space of size K. We convert 
this into a supervised learning problem by considering 
the ''true" structured output to be a label sequence 
of length 2T, with the first T components drawn from 
the label space of size K and the second T components 
drawn from the input vocabulary. The loss func;tion 
can then be anything that depends only on the last 
T components. For simplicity, we can consider it to 
be Hamming loss. The construction of the optimal 
policy in this case is straightforward. For the first T 
components, tt* may behave arbitrarily (e.g., it may 
produce a uniform distribution over the K labels). For 
the second T components, n* always predicts the true 
label (which is known, because it is part of the input). 

An important aspect of the model is the construction 
of the feature vectors. It is most useful to consider this 
construction as having two parts. The first part has 
to do with predicting the hidden structure (the first T 
components). The second part has to do with predict- 
ing the observed structure (the second T components). 
For the first part, we are free to use whatever features 
we desire, so long as they can be computed based on 
the input x and a partial output. For instance, in the 
HMM case, we could use the two most recent label 
predictions and windowed features from x. 

The construction of the features for the second part is, 
however, also crucial. For instance, if the feature vec- 
tor corresponding to "predict the tth component of a;" 
contains the t component of x, then this learning prob- 



lem is trivial — but also renders the latent structure 
useless. The goal of the designer of the feature space 
is to construct features for predicting xt that crucially 
depend on getting the latent structure y correct. That 
is, the ideal feature set is one for which you can predict 
Xt accurately if an only if we have found the correct 
latent structure (more on this in Section 5). For in- 
stance, in the HMM case, we may predict Xt based 
only on the corresponding label yt, or maybe on the 
basis of yt-i,yt,yt+i- (Note that we are not limited 
to the Markov assumption, as in the case of HMMs.) 

In the first iteration of Searn, all costs for the predic- 
tion of the latent structure are computed with respect 
to the initial policy. Recalling that the initial policy 
behaves randomly when predicting the latent labels 
and correctly when predicting the words, we can see 
that these costs are all zero. Thus, for the latent struc- 
ture actions, Searn will not induce any classification 
examples (because the cost of all actions is equal). 
However, it will create example for predicting the x 
component. For predicting the xs, the cost will be zero 
for the corr(H;t word and one for any incorrect word. 
These examples will have associated features: we will 
predict word Xt based exclusively on yt. Remember: 
yt was generated randomly by the initial policy. 

In the second iteration, the behavior is different. 
Searn returns to creating examples for the latent 
structure components. However, in this iteration, 
since the current policy is not longer optimal, the fu- 
ture cost estimates may be non-zero. Consider gen- 
erating an example corresponding to a (latent) state 
yt- For some small percentage (as dictated by f3) of the 
"generate x" decisions, the previously learned classifier 
will fire. If this learned classifier does well, then the 
associated cost will be low. However, if the learned 
classifier does poorly, the the associated cost will be 
high. Intuitively, the learned classifier will do well if 
and only if the action that labels yt is "good" (i.e., 
consistent with what was learned previously). This, in 
the second pass through the data, Searn does create 
classification examples specific to the latent decisions. 

As Searn iterates, more and more of the latent pre- 
diction decisions are made according to the learned 
classifiers and not with respect to the random policy. 

4. CompcLrison to EM 

In this section, we show an equivalence between ex- 
pectation maximization in directed probabilistic struc- 
tures and unsupervised Searn. We use mixture of 
multinomials as a motivating example (primarily for 
simplicity), but the results easily extend to more com- 
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plicated models (e.g., HMMs: see Section 4.3). 

4.1. EM for Mixture of Multinomials 

In the mixture of multinomials problem, we are given 
N documents di, . . . ,(1^, where <i„ is a vector of word 
counts over a vocabulary of size V; that is, (i„ „ is 
the number of times word v appeared in document n. 
The mixture of muhinomials is a probabihstic cluster- 
ing model, where we assume an underlying set of K 
clusters (multinomials) that generated the documents. 
Denote by Ok the multinomial parameter associated 
with cluster k, pk the prior probability of choosing 
cluster fc, and let Zn be an indicator vector associat- 
ing document n with the unique cluster k such that 
^n,fc = 1- The probabilistic model has the form: 



Mdi^,p)=ni4^En 



Pk 



(3) 

Expectation maximization in this model involves first 
computing expectations over the z vectors and then 
updating the model parameters d: 



E-step: Zn,k « pfc ^f"/ 



(4) 



M-step: 6k,v <x ^ z„,kd„,v ; Pk <x ^ Zn,k (5) 

n n 

In both cases, the constant of proportionality is chosen 
so that the variables sum to one over the last compo- 
nent. These updates are repeated until convergence of 
the incomplete data likelihood, Eq (3). 

4.2. An Equivalent Model in Searn 

Now, we show how to construct an instance of unsu- 
pervised Searn that effectively mimics the behavior 
of EM on the mixture of multinomials problem. The 
ingredients are as follows: 

• The input space X is the space of documents, repre- 
sented as word count vectors. 

• The (latent) output space y is a single discrete vari- 
able in the range [1, K] that specifies the cluster. 

• The feature set for predicting y (document counts). 

• The feature set for predicting x is the label y and the 
total number of words in tiie doeument. The predic- 
tions for a document arc estimated word probabilities, 
not the words themselves. 

• Tlie loss function ignores the prediction y and returns 
the log loss of the true document x under the word 
probabilities predicted. 

• The cost-sensitive learning algorithm is different de- 
pending on whether the latent structure y is being 
predicted or if the document x is being predicted: 



Structure: The base classifier is a multinomial 
naive Bayes classifier, parameterized by (say) h'" 

Document: The base classifier is a collection 
of independent maximum likelihood multinomial 
estimators for each cluster. 



Consider the behavior of this setup. In particular, con- 
sider the distribution Searn(I?^^, £, tt). There are two 
"types" of examples drawn from this distribution: (1) 
latent structure examples and (2) document examples. 
The claim is that both classifiers learned are identical 
to the mixture of multinomials model from Section 4.1. 

Consider the generation of a latent struc;turc exam- 
ple. First, a document n is sampled uniformly from 
the training set. Then, for each possible label k of this 
document, a cost E^^^l{{y,dn),{k,d)) is computed. 
By definition, the d that is computed is exactly the 
prediction according to the current multinomial esti- 
mator, /i™. Interpreting the multinomial estimator in 
terms of the EM parameters, the costs are precisely the 
Zn,kS from EM (see Eq (4)). These latent structure ex- 
amples are fed in to the multinomial naive Bayes clas- 
sifier, which re-estimates a model exactly as per the 
M-step in EM (Eq (5)). 

Next, consider the generation of the document exam- 
ples. These examples are generated by tt first choos- 
ing a cluster according to the structure classifier. This 
cluster id is then used as the (only) feature to the "gen- 
erate document" multinomial. As we saw before, the 
probability that tt will select label k for document n 
is precisely Zn^k from Eq (4). Thus, the multinomial 
estimator will effectively receive weighted examples, 
weighted by these Zn,kS, thus making the maximum 
likelihood estimate exactly the same as the M-step 
from EM (Eq (5)). 

4.3. Synthetic experiments 

To demonstrate the advantages of the generality of 
Searn, we report here the result of some experiments 
on synthetic data. We generate synthetic data ac- 
cording to two different HMMs. The first HMM is 
a first-order model. The initial state probabilities, the 
transition probabilities, and the observation probabil- 
ities are all drawn uniformly. The second HMM is a 
second-order model, also will all probabilities drawn 
uniformly. The lengths of observations are given by a 
Poisson with a fixed mean. 

In our experiments, we consider the following learn- 
ing algorithms: EM, Searn with HMM features and 
a naive Bayes classifier, and Searn with a logistic 
regression classifier (and an enhanced feature space: 
predicting yt depends on Xt-i-.t+i- The first Searn 
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Table 1. Error rates on first- and second-order Markov data with 2. 5 or 10 latent states. Models are the true data 
generating distribution (approximated by a first-order Markov model in the case of HMM2), a model learned by EM, one 
learned by Searn with a naive Bayes base classifier, and one learned by Searn with a logistic regression base classifier. 
Standard deviations arc given in small text. The best results by row are bolded; the results within the standard deviation 

of the best r esults arc italicized. 



Model 


States 


Truth 


EM 


Searn -NB Searn -LR 


1st order HMM 
1st order HMM 

1st order HMM 


K = 2 
K = 5 

K = 10 


0.227 ±0.107 
0.687 ±0.043 
0.806 ±0.035 


0.275 ±0.128 
0.678 ±0.026 
0. 762 ±0.021 


0.287 ±0.138 0.276 ±0.095 
0.688 ±0.025 0.672 ±0.022 
0.771 ±0.019 0.755 ±0.019 


2nd order HMM 
2nd order HMM 
2nd order HMM 


K = 2 
K = 5 
K = 10 


0.294 ±0.072 

0.651 ±0.068 
0.815 ±0.032 


0.396 ±0.057 
0.695 ±0.027 
0.764 ±0.021 


0.408 ±0.056 0.271 ±0.057 
0.710 ±0.016 0.633 ±0.018 
0.771 ±0.015 0.705 ±0.019 



should mimic EM, but by using samphng rather than 
exact expectation computations. The models are all 
first-order, regardless of the underlying process. 

We run the following experiment. For a given number 

of states (which wc will vary), wc generate 10 random 
data sets according to each model. Each data set con- 
sists of 5 examples with mean example length of 40 ob- 
servations. The vocabulary size of the observed data is 
always 10. We compute error rates by matching each 
predicted label to the best-matching true label and the 
compute Hamming loss. Forward-backward is initial- 
ized randomly. We run experiments with the number 
of latent states equal to 2, 5 and 10.^ 

The results of the experiments are shown in Ta- 
ble 1. The observations show two things. When the 
true model matches the model we attempt to learn 
(HMMl), there is essentially no statistically signifi- 
cant difference between any of the algorithms. Where 
once sees a difference is when the true model does not 
match the learned model (HMM2). In this case, we see 
that Searn-LR obtains a significant advantage over 
both EM and Searn-NB, due to its abihty to employ 
a richer set of features. These results hold over all 
values of K. This is encouraging, since in the real 
world our model is rarely (if ever) right. The (not sta- 
tistically significant) difference in error rates between 
EM and Searn-NB are due to a sampling versus ex- 
act computation of expectations. Many of the models 
outperform "truth" because likelihood and accuracy 
do not necessarily correlate (Liang & Klein, 2008). 

5. Analysis 

There are two keys to success in unsupervised-SEARN. 
The first key is that the features on the 3^-component 
of the output space be descriptive enough that it be 

^Wc ran experiments varying the number of samples 
Searn uses in {1, 2, 5}; there was no statistically significant 
difference. The results we report are based on 2 samples. 



Icarnable. One way of thinking of this constraint is 
that if we had labeled data, then we would be able to 
learn well. The second key is that the features on the 
A'-componcnt of the output space be intrinsically tied 
to the hidden component. Ideally, these features will 
be such that X can be predicted with high accuracy if 
and only if y is predicted accurately. 

The general though very trivial result is that if we 
can guarantee that the loss on y is bounded by some 
function / of the loss on X, then the loss on y is 
guaranteed after learning to be bounded by /(i(7r*) -|- 
24vgI'max/nT'max + c{l + InT^^^) / T^^^) , where all the 
constants now depend on the induced structured pre- 
diction problem; see Eq 2. 

One can sec the unsupervised Searn analysis as jus- 
tifying a small variant on "Viterbi training" -the pro- 
cess of performing EM where the E-step is approxi- 
mated with a delta function centered at the maximum. 
One significant issue with Viterbi training is that it is 
not guaranteed to converge. However, Viterbi training 
is recovered as a special case of unsupervised Searn 
where the interpolation parameter is fixed at 1. While 
the Searn theorem no longer applies in this degen- 
erate case, any algorithm that uses Viterbi training 
could easily be retrofitted to simply make some de- 
cisions randomly. In doing so, one would obtain an 
algorithm that does have theoretical guarantees. 

6. Unsupervised Dependency Parsing 

The dependency formalism is a practical and linguis- 
tically interesting model of syntactic structure. One 
can think of a dependency structure for a sentence of 
length T as a directed tree over a graph over T + 1 
nodes: one node for each word plus a unique root 
node. Edges point from heads to dependents. An ex- 
ample dependency structure for a T = 7 word sentence 
is shown in Figure 2 . To date, unsupervised depen- 
dency parsing has only been viewed in the context of 
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root John hit the ball with the bat 

Figure 2. Dependency parse of a T = 7 word sentence. 

global probabilistic models specified over dependency 

pairs (Paskin, 2001) or spanning trees (Klein & Man- 
ning, 2004; Smith & Eisner, 2005). However, there 
is an alternative, popular method for producing de- 
pendency trees in a supervised setting: shift-reduce 
parsing (Nivre, 2003; Sagae & Lavie, 2005). 

6.1. Shift-reduce dependency peirsing 

Shift- reduce dependency parsing (Nivre, 2003) is a left- 
to-right parsing algorithm that operates by maintain- 
ing three state variables: a stack S, a current posi- 
tion i and a set of arcs A. The algorithm begins with 
{S, i, A) = (0, 1, 0): the stack and arcset are empty and 
the current index is 1 (the first word). The algorithm 
then proceeds through a scries of actions until a final 
state is reached. A final state is one in which i = T, at 
which point the set A contains all dependency edges 
for the parse. Denote by i\I a stack with i at the head 
and stack / at the tail. There arc four actions: 

LeftArc: {t\S,i,A} — > {S,i, {i,t)\A) , so long as there 
does not exist an arc (■,t) G A. (Adds a left depen- 
dency to the arc set between the word t at the top of 
the stack and the word i at the current index.) 

RightArc: {t\S,i,A) — > {i\t\s,i + l,{t,i)\A), so long as 
there is no arc {■,{) £ A. (Adds a right dependency 
between the top of the stack and the next input.) 

Reduce: {t\S, i, A) — > {S, i, A), so long as there does ex- 
ist an arc (•, t) £ A. (Removes a word from the stack.) 

Shift: {S, i, A) — > {n\S, i + l,A}. (Place item on stack.) 

This algorithm is guaranteed to terminate in at most 
2T steps with a vahd dependency tree (Nivre, 2003), 
unlike standard probabilistic algorithms that have a 
time-complexity that is cubic in T (McDonald & Satta, 
2007). The advantage of the shift-reduce framework is 
that it fits nicely into Searn. However, until now, it 
has been an open question how to train a shift-reduce 
model in an unsupervised fashion. The techniques de- 
scribed in this paper give a solution to this problem. 

6.2. Experimental setup 

We follow the same experimental setup as (Smith & 
Eisner, 2005), using data from the WSJIO corpus (sen- 
tences of length at most ten from the Penn Treebank 
(Marcus et al., 1993)). The data is stripped of punctu- 
ation and parsing depends on the part-of-speech tags. 



Table 2. Accuracy on training and test data, plus number 
of iterations for a variety of dependency parsing algorithms 
(all unsupervised except for the last two rows). 



Algorithm 


Acc-Tr 


Acc-Tst 


# Iter 


Rand-Gen 
Rand-SEARN 


23.5 ±0.9 
21.3 ±0.2 


23.5 ±1.3 
21.0 ±0.6 




K+M:Rand-Init 
K+M:Smart-Init 


23.6 ±3.8 
35.2 ±6.6 


23.6 ±4.3 
35.2 ±6.0 


63.3 
64.1 


S+E:Lcngth 

S-hE:DelOrTransl 

S-|-E:Transl 


33.8 ±3.6 
47.3 ±6.0 
48.8 ±0.9 


33.7 ±5.9 

47.1 ±5.9 

49.0 ±1.5 


173.1 

132.2 
173.4 


Searn: Unsup 


45.8 ±1.6 


45.4 ±2.2 


27.6 


S+E: Sup 
Searn: Sup 


79.9 ±0.2 
81.0 ±0.3 


78.6 ±0.8 
81.6 ±0.4 


350.5 
24.4 



not the words. We use the same train/dcv/test split 
as Smith and Eisner: 5301 sentences of training data, 
531 sentences of development data and 530 sentences 
of blind test data. All algorithm development and tun- 
ing was done on the development data. 

We use a slight modification to SearnShell to facili- 
tate the development of our algorithm together with a 
multilabel logistic regression classifier, MegaM.^ Our 
algorithm uses the following features for the tree-based 
decisions (inspired by (Hall et al., 2006)), where t is 
the top of the stack and i is the next token: the parts- 
of-speech within a window of 2 around t and i; the pair 
of tokens at t and i; the distance (discretized) between 
t and i; and the part-of-speech at the head (resp. tail) 
of any existing arc pointing to (resp. from) t or i. 
For producing word i, we use the part of speech of i's 
parent, grandparent, daughters and aunts. 

We use Searn with a fixed /3 = 0.1. One sample 
is used to approximate expected losses. The devel- 
opment set is used to tune the scale of the prior vari- 
ances for the logistic regression (different variances are 
allowed for the "produce tree" and "produce words" 
features). The initial policy makes uniformly random 
decisions. Accuracy is directed arc accuracy. 

6.3. Experimental results 

The baseline systems are: two random baselines (one 
generative, one given by the Searn initial policy), 
Klein and Manning's model (Klein & Manning, 2004) 
EM-based model (with and without clever initializa- 
tion) , and three variants of Smith and Eisner's model 
(Smith & Eisner, 2005) (with random initialization, 
which seems to be better for most of their mod- 

^ SearnShell and MegaM are available at http: //searn. 
halS.name and http://hal3.name/megain, respectively. 
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els). We also report an "upper bound" performance 
based on supervised training, for both the probabilistic 
(Sniith+Eisner model) as well as supervised Searn. 

The results are reported in Table 2: accuracy on the 
training data, accuracy on the test data and the num- 
ber of iterations required. These are all averaged over 
10 runs; standard deviations are shown in small print. 
Many of the results (the non-SEARN results) are copied 
from (Smith & Eisner, 2005). The stopping criteria 
for the EM-based models is that the log likelihood 
changes by less than lOe — 5. For the SEARN-based 
methods, the stopping criteria is that the development 
accuracy ceases to increase (on the individual classifi- 
cation tasks, not on the structured prediction task). 

All learned algorithms outperform the random algo- 
rithms (except Klein+Manning with random inits). 
K+M with smart initialization does slightly better 
than the worst of the S-fE models, though the differ- 
ence is not statistically significant. It does so need- 
ing only about a third of the number of iterations 
(moreover, a single S+E iteration is slower than a sin- 
gle K-l-M iteration). The other two S-l-E models do 
roughly comparably in terms of performance (strictly 
dominating the previous methods) . One of them ( "De- 
lOrTransl" ) requires about twice as many iterations as 
K-|~M; the other ("Transl") requires about three times 
(but has much high performance variance). Unsuper- 
vised Searn performs halfway between the best K-l-M 
model and the best S-fE model (it is within the error 
bars for "DelOrTransl" but not "Transl"). 

Nicely, it takes significantly fewer iterations to con- 
verge (roughly 15%). Moreover, each iteration is quite 
fast in comparison to the EM-based methods (a com- 
plete run took roughly 3 hours on a 3.8GHz Opteron 
using SearnShell). Finally, we present results for the 
supervised case. Here, we see that the SEARN-based 
method converges much more quickly to a better solu- 
tion than the S-l-E model. Note that this comparison 
is unfair since the SEARN-based model uses additional 
features (though it is a nice property of the SEARN- 
based model that it can make use of additional fea- 
tures). Nevertheless we provide it so as to give a sense 
of a reasonable upper-bound. We imagine that includ- 
ing more features would shift the upper-bound and the 
unsupervised algorithm performance up. 

7. A Semi-Supervised Version 

The unsupervised learning algorithm described above 
naturally extends to the case where some labeled data 
is available. In fact, the only modification to the al- 
gorithm is to change the loss function. In the unsu- 
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Figure 3. Parsing accuracy for semi-supervised, supervised 
and unsupervised Searn. X-axis is: (semi/sup) # of la- 
beled examples; (unsup) # of unlabeled examples. 

pervised case, the loss function completely ignores the 
latent structure, and returns a loss dependent only on 
the "predict self" task. In the semi-supervised version, 
one plugs in a natural loss function for the "latent" 
structure prediction for the labeled subset of the data. 

In Figure 3, we present results on dependency pars- 
ing. We show learning curves for unsupervised, fully 
supervised and semi-supervised models. The x-axis 
shows the number of examples used; in the unsuper- 
vised and supervised cases, this is the total number of 
examples; in the semi-supervised case, it is the num- 
ber of labeled examples. Error bars are two standard 
deviations. Somewhat surprisingly, with only five la- 
beled examples, the semi-supervised approach achieves 
an accuracy of over 70%, only about 10% behind the 
fully supervised approach with 5182 labeled examples. 
Eventually the supervised model catches up (at about 
250 labeled examples). The performance of the unsu- 
pervised model continues to grow as more examples 
are provided, but never reaches anywhere close to the 
supervised or semi-supervised models. 

8. Conclusions 

We have described the application of a search-based 
structured prediction algorithm, Searn, to unsuper- 
vised learning. This answers positively an open ques- 
tion in the field of learning reductions (Beygelzimer 
et al., 2005): can unsupervised learning be reduced 
to supervised learning? We have shown a near- 
equivalence between the resulting algorithm and the 
forward-backward algorithm in hidden Markov mod- 
els. We have shown an application of this algorithm 
to unsupervised dependency parsing in a shift-reduce 
framework. This provides the first example of unsu- 
pervised learning for dependency parsing in a non- 
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probabilistic model and shows that unsupervised shift- 
reduce parsing is possible. One obvious extension of 
this work is to structured prediction problems with 
additional latent structure, such as in machine trans- 
lation. Instead of using the predict-self methodology, 
one could directly apply a predict-target methodology. 

The view of "predict the input" for unsupervised 
learning is impUcit in many unsupervised learning ap- 
proaches, including standard models such as restricted 
Boltzmann machines and Markov random fields. This 
is made most precise in the wake-sleep algorithm (Hin- 
ton et al., 1995), which explicitly trains a neural net- 
work to reproduce its own input. The wake-sleep al- 
gorithm consists of two phases: the wake phase, where 
the latent layers are produced, and the sleep phase, 
where the input is (re-) produced. These two phases 
are analogous to the predict-structure phase and the 
predict-words phase in unsupervised Searn. 
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